ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group99a.txt / 000124_icon-group-sender _Fri Jun 4 14:45:35 1999.msg < prev next >

Wrap

Internet Message Format | 2000-09-20 | 5KB

Return-Path: <icon-group-sender> Received: (from root@localhost) by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) id OAA03881 for icon-group-addresses; Fri, 4 Jun 1999 14:41:55 -0700 (MST) Message-Id: <199906042141.OAA03881@baskerville.CS.Arizona.EDU> Delivered-To: icon-group@silliac.cs.arizona.edu Date: Fri, 04 Jun 1999 09:09:29 -0700 From: Steve Wampler <swampler@noao.edu> X-Accept-Language: en To: icon-group@optima.CS.Arizona.EDU Subject: Re: Find the longest matching substrings Errors-To: icon-group-errors@optima.CS.Arizona.EDU Status: RO This is a multi-part message in MIME format. --------------09EB4280837648B91A00FBDF Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit gep2@terabites.com wrote: > > > I am working on a project that will mark the inserted or deleted phases > between two strings. I am wondering whether you know any library > procedure that will anchor ( or report ) the longest phases that exists > in both strings. > > > For example, assume that we have two strings: > old:="other strings ab cd e and more substrings" > new:="some more substrings ab cd ab cd e" > > > This function should return phases and position in both strings, i.e. > [15, 28, "ab cd e"] for "ab cd e" is the longest phase that matched ( > it has three continuous matching elements ). > > If I were going to try to make something like this FAST (and especially if one > of the strings were relatively constant) I'd probably want to use a hash (or > even direct index?) table pre-computed with the offsets of character substrings > (I'd probably use two-character adjacent pairs) in one of those strings; this > would enable stepping rapidly through the second string and quickly finding at > least (only just the) plausible candidates for launching more detailed substring > compare (and length computation) operations. This kind of thing would likely be > storage-hungry, but that's less of a concern today than it once was. Taking Gordon's comment on speed to heart, here's an adaptation of the previous solution that is faster (uses simple heuristics to avoid looking at phrases that can't possibly be longer than ones that are found already). Note that it can probably be sped up even more by improving these heuristics a bit. This version is about 2.5x faster on my test case (the "that that is..." nonsense). Applying this adaptation to my solution that ignores punctuation when comparing phrases results in a speedup of about 25x! (Because the test case strings match each other when ignoring punctuation.) Oh, I also modified the test driver to do time comparisions. Incidently, it would be 'better' if the phrase element procedures were converted into string scanning procedures. Scanning strings for phrase elements sounds useful to me and can considered as a generalization of scanning strings for characters - in fact, the code here is just an adaptation of code to find the longest common substring. It would also be better if the code were rewritten to accept different definitions of a phrase element more easily! -- Steve Wampler- SOLIS Project, National Solar Observatory swampler@noao.edu --------------09EB4280837648B91A00FBDF Content-Type: text/plain; charset=us-ascii; name="phrase3.icn" Content-Disposition: inline; filename="phrase3.icn" Content-Transfer-Encoding: 7bit procedure main(args) n := args[1]|10 start := &time every 1 to n do { result := longPhrase(read(),read()) } len := &time - start write("[",result[1],",",result[2],",\"",result[3],"\"]") write(real(len)/n,"ms per loop for ",n," loops") end # Find longest phrase common to s1 and s2. Fails if no common phrase. # # Assumes simple definition for a phrase (sequence of letters) # Assumes punctuation is to be considered when matching phrases # procedure longPhrase(s1,s2) static pChars initial pChars := &letters # chars legal in phrase element maxlen := 0 every i := pStart(s1, pChars) do { every k := rpStop(s1, i, pChars) do { if j := findPhrase(s1[i:k], s2, pChars) then { if maxlen <:= numPhrases(s1[i:k], pChars) then { saveStart1 := i saveStop1 := k saveStart2 := j } break # no need to find shorter phrases } # that start here } } return [\saveStart1, saveStart2, s1[saveStart1:saveStop1]] end procedure pStart(s, pc) # generate start of all phrases in s local i s ? { while tab(i := upto(pc)) do { tab(many(pc)) suspend i } } end procedure rpStop(s, i, pc) # generates end of all phrases in s, # starting from right of string suspend *s+2 - pStart(reverse(s[i:0]), pc) end procedure numPhrases(s, pc) # Count number of phrases in s local count count := 0 every pStart(s,pc) do count +:= 1 return count end procedure findPhrase(p, s, pc) # find phrase p in s local i s ? { while tab(i := upto(pc)) do { if match(p) then suspend i tab(many(pc)) } } end --------------09EB4280837648B91A00FBDF--